{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Set and get hyperparameters in scikit-learn\n",
    "\n",
    "Recall that hyperparameters refer to the parameters that control the learning\n",
    "process of a predictive model and are specific for each family of models. In\n",
    "addition, the optimal set of hyperparameters is specific to each dataset and\n",
    "thus they always need to be optimized.\n",
    "\n",
    "This notebook shows how one can get and set the value of a hyperparameter in a\n",
    "scikit-learn estimator.\n",
    "\n",
    "They should not be confused with the fitted parameters, resulting from the\n",
    "training. These fitted parameters are recognizable in scikit-learn because\n",
    "they are spelled with a final underscore `_`, for instance `model.coef_`.\n",
    "\n",
    "We start by loading the adult census dataset and only use the numerical\n",
    "features."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "adult_census = pd.read_csv(\"../datasets/adult-census.csv\")\n",
    "\n",
    "target_name = \"class\"\n",
    "numerical_columns = [\"age\", \"capital-gain\", \"capital-loss\", \"hours-per-week\"]\n",
    "\n",
    "target = adult_census[target_name]\n",
    "data = adult_census[numerical_columns]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Our data is only numerical."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's create a simple predictive model made of a scaler followed by a logistic\n",
    "regression classifier.\n",
    "\n",
    "As mentioned in previous notebooks, many models, including linear ones, work\n",
    "better if all features have a similar scaling. For this purpose, we use a\n",
    "`StandardScaler`, which transforms the data by rescaling features."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.pipeline import Pipeline\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "\n",
    "model = Pipeline(\n",
    "    steps=[\n",
    "        (\"preprocessor\", StandardScaler()),\n",
    "        (\"classifier\", LogisticRegression()),\n",
    "    ]\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can evaluate the generalization performance of the model via\n",
    "cross-validation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import cross_validate\n",
    "\n",
    "cv_results = cross_validate(model, data, target)\n",
    "scores = cv_results[\"test_score\"]\n",
    "print(\n",
    "    \"Accuracy score via cross-validation:\\n\"\n",
    "    f\"{scores.mean():.3f} \u00b1 {scores.std():.3f}\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We created a model with the default `C` value that is equal to 1. If we wanted\n",
    "to use a different `C` hyperparameter we could have done so when we created the\n",
    "`LogisticRegression` object with something like `LogisticRegression(C=1e-3)`.\n",
    "\n",
    "<div class=\"admonition note alert alert-info\">\n",
    "<p class=\"first admonition-title\" style=\"font-weight: bold;\">Note</p>\n",
    "<p class=\"last\">For more information on the model hyperparameter <tt class=\"docutils literal\">C</tt>, refer to the\n",
    "<a class=\"reference external\" href=\"https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html\">documentation</a>.\n",
    "Be aware that we will focus on linear models in an upcoming module.</p>\n",
    "</div>\n",
    "\n",
    "We can also change the hyperparameter of a model after it has been created\n",
    "with the `set_params` method, which is available for all scikit-learn\n",
    "estimators. For example, we can set `C=1e-3`, fit and evaluate the model:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model.set_params(classifier__C=1e-3)\n",
    "cv_results = cross_validate(model, data, target)\n",
    "scores = cv_results[\"test_score\"]\n",
    "print(\n",
    "    \"Accuracy score via cross-validation:\\n\"\n",
    "    f\"{scores.mean():.3f} \u00b1 {scores.std():.3f}\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When the model of interest is a `Pipeline`, the hyperparameter names are of\n",
    "the form `<model_name>__<hyperparameter_name>` (note the double underscore in\n",
    "the middle). In our case, `classifier` comes from the `Pipeline` definition\n",
    "and `C` is the hyperparameter name of `LogisticRegression`.\n",
    "\n",
    "In general, you can use the `get_params` method on scikit-learn models to list\n",
    "all the hyperparameters with their values. For example, if you want to get all\n",
    "the hyperparameter names, you can use:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for parameter in model.get_params():\n",
    "    print(parameter)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`.get_params()` returns a `dict` whose keys are the hyperparameter names and\n",
    "whose values are the hyperparameter values. If you want to get the value of a\n",
    "single hyperparameter, for example `classifier__C`, you can use:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model.get_params()[\"classifier__C\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can systematically vary the value of C to see if there is an optimal\n",
    "value."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for C in [1e-3, 1e-2, 1e-1, 1, 10]:\n",
    "    model.set_params(classifier__C=C)\n",
    "    cv_results = cross_validate(model, data, target)\n",
    "    scores = cv_results[\"test_score\"]\n",
    "    print(\n",
    "        f\"Accuracy score via cross-validation with C={C}:\\n\"\n",
    "        f\"{scores.mean():.3f} \u00b1 {scores.std():.3f}\"\n",
    "    )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see that as long as C is high enough, the model seems to perform well.\n",
    "\n",
    "What we did here is very manual: it involves scanning the values for C and\n",
    "picking the best one manually. In the next lesson, we will see how to do this\n",
    "automatically.\n",
    "\n",
    "<div class=\"admonition warning alert alert-danger\">\n",
    "<p class=\"first admonition-title\" style=\"font-weight: bold;\">Warning</p>\n",
    "<p class=\"last\">When we evaluate a family of models on test data and pick the best performer,\n",
    "we can not trust the corresponding prediction accuracy, and we need to apply\n",
    "the selected model to new data. Indeed, the test data has been used to select\n",
    "the model, and it is thus no longer independent from this model.</p>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this notebook we have seen:\n",
    "\n",
    "- how to use `get_params` and `set_params` to get the hyperparameters of a model\n",
    "  and set them."
   ]
  }
 ],
 "metadata": {
  "jupytext": {
   "main_language": "python"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "name": "python3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}